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(54) Speech processor 

(57) In a speech processor such as a speech recog- 
niser, the problem of detecting the beginning and end of 
speech or a word accurately, to enable the creation of a 
speech or a word template which consistently matches 
stored speech or word templates is solved by character- 
ising background noise and forming a background noise 



template, setting a speech threshold above which 
speech is detected and stored, and subtracting the 
background noise template from the stored speech to 
form a speech template. 
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Description 



This invention relates to speech processors, and in particular to speech recognisers which are able to detect the 
beginning and end of input speech. 

Automatic speech recognisers work by comparing features extracted from audible speech signals Features 
extracted from the speech to be recognised are compared with stored features extracted from a known utterance 

For accurate recognition it is important that the features extracted from the same word or sound then spoken at dif- 
ferent times are sufficiently similar. However, the large dynamic range of speech makes this difficult to achieve partic- 
ularly in areas such as hands-free telephony where the sound level received by the microphone can vary over a wide 
range. In order to compensate for this speech level variation, most speech recognisers use some form of automatic gain 
control (AGC). 

The AGC circuit controls the gain to ensure that the average signal level used by the feature extractor is as near 
constani as possible over a given time period. Hence quiet speech utterances are given greater gain than loud utter- 
ances. This form of AGC performs well when continuous speech is the input signal since after a period of time the cir- 
cuit gain will optimise the signal level to give consistent feature extraction. However, in the absence of speech the gain 
of the AGC circuit will increase to a level determined by the background noise, so that at the onset of a speech utterance 
the gain of the AGC circuit will be set too high. During the utterance the gain of the circuit is automatically reduced the 
speed of the gain change being determined by the 'attack' time of the AGC. The start of the utterance is thus subjected 
to a much greater gam and any features extracted will have a much greater energy content than similar features 
20 extracted later, when the gain has been reduced. 

This distortion effect is dependent on the input signal level; the higher the speech level the larger is the distortion 
Hence the first few features extracted will not correspond to the notionally similar stored features, and this can often 
result in poor recognition performance. 

Recognition performance, as well as depending on how the AGC deals with noise level, depends on the speech 
recogniser's ability to detect the beginning and end of the speech or word with a high degree of accuracy, where the 
speech or word will become th e subject of a speech template. 

The present invention seeks to provide a solution to the problem of detecting the beginning and end of the speech 
or word with a high degree of accuracy. 

According to a first aspect, the present invention provides a method of processing speech in which the start and 
end points of a speech sample in an input signal are determined by establishing an initial threshold based on a measure 
of the noise level in said input signal, the initial threshold being used to establish an initial stored speech sample which 
initial speech sample is then further processed using a further threshold level which is at a predetermined level beneath 
a maximum level of said initial speech sample, said further threshold level being used to determine the start and end 
points. 

as According to a second aspect, the present invention provides a method of detecting speech comprising the steps 

determining a speech energy threshold for an input channel; 
forming a background noise template for the input channel; 
sampling the input channel and storing the samples into a first store; 
40 detecting speech by determining when a sample exceeds the speech threshold, and storing the sample and sub- 

sequent samples into a second store; 

detecting the end of the speech by determining when a predefined number of the subsequent samples drop 
below the threshold; . 

forming a first speech template by augmenting the samples from the second store with a predefined number of 
45 samples from the first store; and 

forming a second speech template by subtracting the background noise template from the first speech template 
In a method of detecting speech according to the second aspect of the present invention, it is necessary to char- 
acterise the background noise before speech-is detected, so that the AGC level can be set to cope with the maximum 
noise level. A template representative of the background noise is formed and the speech threshold is set to be a spec- 
so ified amount above the maximum background noise level, so that speech is detected as soon as the speech threshold 
is exceeded. Also, the end of the speech is detected by recognising when the noise level drops below a threshold for a 
specified interval. Throughout speech detection, the AGC compensates for increasing or decreasing noise levels so 
that, ultimately, a speech template can be provided which has constant gain. 

Embodiments of the invention will be further described and explained by way of example only, with reference to the 
55 accompanying drawing, in which; 

Figure 1 is a schematic diagram of a speech recogniser according to the present invention. 

Throughout this patent application the invention is described with reference to a speech recogniser utilising tem- 
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plate-matching, but as those skilled in the art will be aware, the invention is equally applicable to any of the conventional 
types of speech recogniser, including those using stochastic modelling, Markov chains, dynamic-timewarping and pho- 
neme-recognition. 

Speech recognition is based on comparing energy contours from a number (generally 8 to 16) of filter channels. 

s While speech is present, the energy spectrum from each filter channel is digitised with an Analogue to Digital (A-D) con- 
verter to produce a template which is stored in a memory. 

The initial stage ol recognition is known as 'training* and consists of producing the reference templates by speaking 
to the recogniser the words which are to be recognised. Once reference templates have been made for the words to be 
recognised, recognition of speech can be attempted. 

10 When the recogniser is exposed to an utterance, it produces a test template which can be compared with the ref- 
erence templates in the memory to find the closest match. 

The fundamental elements of the speech recogniser according to the present invention are shown in Figure 1. 
Voice signals received by the microphone 1 and amplified by amplifier 2 are passed to a filter bank 3a. In the filter bank 
the voice signals are filtered into a plurality (in this case 16) of frequency bands, and the signals are rectified by rectifier 

is 4. The filtered and rectified signals are smoothed by low pass filters 3b and then sequentially sampled by a multiplexer 
5 which feeds the resultant single channel signal to the DAGC circuit 8 which in turn feeds an Analogue to Digital con- 
verter 6 from which the digitised signal stream is passed to the controlling microprocessor 7. 

The multiplexer addresses each filter channel for 20 microseconds before addressing the next one. At the end of 
each 10 millisecond time slot, each channel's sampled energy for that period is stored. The templates, which are pro- 

20 duced during training or recognition, consist of up to 100 time slot samples for each filter channel. 

The digital AGC operates in the following way. Each time the multiplexer addresses a filter channel, the microproc- 
essor assesses the channel's energy level to determine whether the A-D convertor has been overloaded and hence 
that the gain is too high: When the microprocessor determines that the gain is too high it decrements the AGC's gain 
by 1 step, which corresponds to a reduction in gain of 1 .5dB, and looks again at the channel's energy level. The multi- 

25 plexer does not cycle to the next channel until the microprocessor has determined that the gain has been reduced suf- 
ficiently to prevent overloading of the A-D converter. When the multiplexer does cycle to the next filter channel, the gain 
of the AGC circuit is held at the new low level unless that level results in the overloading of the A*D converter with the 
new channel's energy level, in which case the gain is incremented down as previously described. When the multiplexer 
has addressed the final filter channel, the microprocessor normalises the energy levels of all the channels by setting 

30 their gain coefficients (which have been stored together with the energy level information in memory associated with the 
microprocessor) to the new minimum established by the microprocessor. In this way a consistent set of features are 
extracted independent of the initial input signal gain and any changes in the gain during formation of the template. 

The speech recogniser is required to detect the beginning and end of the speech or word with a high degree of 
accuracy. The speech recogniser according to the present invention uses the following technique: 

35 

A 

The energy level of the background noise is measured and stored for 32 time slots (at 10 milliseconds a sample) 
while simultaneously adjusting (reducing) the gains of the AGC circuit as described above to cope with the maxi- 
mum noise energy. 

40 

B 

The maximum energy sample is found by adding all the fitter values for each time slot, dividing by 16 (the number 
of filter channels) and multiplying by a gain factor corresponding to the gain of the DAGC circuit, and then compar- 
ing each time slot to find the maximum. 

45 

C 

The threshold which needs to be exceeded before speech is deemed to be present is set to be equal to 1 .5 times 
the maximum noise energy determined in Step B. 

so D 

The average noise energy for each fitter channel is found and stored (for each channel it is the sum of energies over 
all 32 time slots, divided by 32) to establish a noise template. 

E 

55 Thereafter, the filter bank is scanned every 10 milliseconds and the data is stored in a temporary cyclic store, of 
100 time samples, until the average filter energy exceeds the noise/speech threshold calculated in C. 

F 

If the noise/speech threshold is not exceeded after 32 samples, a check is performed to ensure that the gain of the 
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40 A DAGC value of 4 is equivalent to a 6dB attenuation of the signal going into the A/D, hence to calculate the "real" 
energy all the filter bank values above would have to be doubled. 
Maximum real energy (averaged over all filters) was:- 410 
Threshold to be exceeded to start/end template recording: -6 15 

Because the invention's primary application is to voice recognition it has been described with reference to that 

45 application. However, as those skilled in the art will be aware, the invention is not only applicable to voice recognition, 
but is applicable to practically any situation where voice signals are processed for feature extraction. 

The speech processor according to the present invention is particularly suitable for use in applications where back- 
ground noise and variations in the level of that background noise are a problem for known speech processors. One such 
application is in hands-free telephony, and in particular hands-free telephony involving cellular radio terminals. Such ter- 

so minals are frequently used in cars, where it is convenient to use speech recognition to provide hands-free call connec- 
tion and dialling. The problem arises however that wind, road and engine noise fluctuate widely and make accurate 
recognition of speech difficult. Clearly, if speech recognition for hands-free telephony is to be fully acceptable in this 
application it is necessary that the recogniser accepts and acts correctly in response to voiced commands in the pres- 
ence of background noise, without routinely requiring that the commands be repeated. 

55 The improved accuracy of recognition provided by the present invention is of particular advantage in this applica- 
tion. 
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Claims 
1. 



min^ t pr ° C< * 5S ' n9 Speect ; ,n whlch ,he start and end Points of a speech sample in an input signal are deter- 
2 * 5 1 f stabl,5h, "9 an ,nrt,al tnresnold b ^ed on a measure of the noise level in said input signal the initial 
threshold bang used to establish an initial stored speech sample, which initial speech sample is then Lher proc 
essed us,ng a further threshold level which is at a predetermined level beneath a maxinL level of saic^ E. 
speech sample, sa.d further threshold level being used to determine the start and end points. • 

W,gnal to determine when the m.tial threshold level is exceeded, whereafter an initial reference pattern is 

b2nf a iL n tn a th P 68 ° h ? Ut w Si9nal WhiCh immediately P recede exceeding of said initial threshold level 
being added to the front end of said initial reference pattern. 

3. A method as claimed in claim 1 or claim 2, wherein the determination of said further threshold level is carried out 
>s on a signal from which a noise estimate has been subtracted. 

A method as claimed in any one of the preceding claims, wherein a normalisation step is carried out prior to the 



20 



25 



determination of said further threshold level 

IS? 1 aS , C ' ai "; ed in t any ° ne ° f the P recedi "9 clai ™. wherein a speech template is produced and stored, the 
speech template be.ng truncated to the start and end points determined using said further threshold. 

A method of processing speech comprising the steps of: determining a speech energy threshold for an input chan- 

forming a background noise template for the input channel; 
sampling the input channel and storing the samples into a first store- 

«, ,h= J^f nQ Spee . Ch by de,ermi ™9 w^n a sample exceeds the speech threshold, and storing the sample and 
subsequent samples into a second store; ^ 

below t d h^h^ n shold e ; end * ***** de,6rminin9 when a predefined number ° f th ^ subsequent samples drop 
ofsampSo^ 

forming a second speech template by subtracting the background noise template from Ihe first speech tem- 



plate. 



35 



40 8. 



^nTf^nl? 9 P 35 C u' med da ' m 6 in Which the s,e P of determining the speech threshold for the 

nput channd compnses averaging the background noise of the input channel over a first time period and setting 

the speech threshold to be H times greater than the average background noise. 

A method of processing speech as claimed in claim 6 or claim 7 in which the background noise template comprises 
a plurality of consecutive samples of the input channel, the samples being taken when no speech is present 

9. A method of processing speech as claimed in any of claims 6 to 8 in which the first store is a cyclic store. 
45 location? ° f Pr ° CeSSinQ SpeeCh 35 daimed in claim 9 wherein c y clic st ° r * comprises 100 addressable store 

1 1. A method of processing speech as claimed in any of claims 6 to 10 in which the second store is a random access 

memory, _ ^ 

so 

1 2. A method of processing speech as claimed in any of claims 6 to 1 1 in which he step of forming the first speech tem- 
K l^Tri aU9m r , i n9 J 1 he S£ ? P ' eS ,r ° m ,hS S6C0nd *"* by addi "9 a P rede,ined numb * of samples from 

^^^^JT^ °\ bSin9 th ° Se WhiCh dirSCtly Preceded the first sam P' e whi ° h ' 

exceeded the speech threshold, in front of the stored samples in the second store. 

13 ' £251? pr ° CeSSi " 9 as claimed in any ° f clai ™ 6 to 12 further comprising a step of checking a gain 

factor of the system, whereby, rf the speech threshold is not exceeded within a predefined number of samples and 

J c^bTS times St St ° re m ° re ,h3n H tim6S Sma " er tha " the SpeeCh threshold ' *" 9 ain ,actor is 
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1 4. A method of processing speech as claimed in any of claims 6 to 13 wherein if the speech threshold is not exceeded 
within a predefined number of samples and not all the samples in the first store are more than H times smaller than 
the speech threshold, the speech threshold is re-calculated by finding the maximum energy sample over a preced- 
ing predefined number of samples and setting the new speech threshold to be H times greater than the maximum 

5 energy sample. 

15. A method of processing speech as claimed in any of claims 6 to 14 in which, if more than a predefined number of 
samples were processed prior to speech being detected, the background noise template is re determined prior to 
the formation of the second speech template. 

10 

16. A method of processing speech as claimed in any of claims 6 to 15 further comprising a step of forming a third 
speech template Irom all samples in the second speech template which exceed a second threshold, the second 
threshold being set at a predetermined level below the maximum energy level of the second speech template. 

is 17. A speech processing apparatus which processes speech according to the method of any one of the preceding 
claims. 

18. A speech processing apparatus as claimed in claim 17 configured as a speech recognizer. 

20 1 9. A speech processing apparatus in which the start and end points of a speech sample in an input signal are deter- 
mined according to a method in which: 

(i) a measure of noise level prevailing in said input signal is determined; 

25 (ii) a signal threshold level T, greater than the determined level, is established; 

(iii) the input signal is sampled and stored to determine when threshold level T is exceeded, whereafter an ini- 
tial reference pattern is stored; 

30 (iv) the n samples of the input signal immediately preceding the exceeding of said threshold level T are added 

to the front end of said initial reference pattern; 

(v) a new threshold level R is derived Irom the pattern produced in step (iv) and this new threshold is used to 
scan a stored signal derived from said initial reference pattern to determine the start and finish points of 
35 speech. 

20. A telephony terminal including a speech processor as claimed in claim 17, claim 18 or claim 19. 
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